feat(pipeline): harvest related legislation discovered during enrichment#901
Conversation
Regelingen reached only through the machine-readable model (source.regulation, legal_basis, and open_terms/implements delegations like regeling_standaardpremie) are never auto-harvested: the recursive harvester follows only <extref> BWB hyperlinks in the source text, and those model links are a product of enrichment. The enrichment agent now returns the related legislation it needs in a result envelope (.enrichment-result.yaml) — deliberately kept OUT of the law YAML so the law artifact stays schema-conformant. After a successful enrich the worker reads the envelope, resolves each entry to a BWB id (agent bwb_id -> law_entries slug -> single-hit SRU title search), and enqueues a follow-up harvest for each. Newly harvested laws auto-enrich and return their own related legislation, so the dependency graph fills itself by recursion. - EnrichPayload.depth carries recursion depth (harvest -> enrich -> harvest); each related harvest is depth+1 at priority 40-(depth+1), so deeper nesting yields to shallower/interactive harvests. - Opt-in: HARVEST_RELATED_LEGISLATION (off by default) + RELATED_HARVEST_MAX_DEPTH (default 2); reuses ENRICH_DAILY_LIMIT for spend and create_harvest_job_if_not_exists for dedup. Best-effort: nothing here can fail the already-committed enrichment. - search_bwb_by_name extracted from the axum handler for reuse; find_bwb_id_by_slug made pub; slug hits re-validated as BWB (CVDR skipped). RFC-025 documents the pattern and its known limitations.
Drop the HARVEST_RELATED_LEGISLATION opt-in flag. The follow-up harvest only enqueues harvest jobs (no LLM cost); the expensive re-enrichment of those laws stays gated by ENRICH_AUTO_ENQUEUE + ENRICH_DAILY_LIMIT, and the recursion is bounded by RELATED_HARVEST_MAX_DEPTH. So there is nothing to protect behind a separate flag.
The enrichment result envelope (.enrichment-result.yaml) and the existing .enrichment.yaml metadata sidecar are written into a law directory but are not law files. `find -name '*.yaml'` matches leading-dot names and the pre-commit `files:` regex matches them too, so validating one fails (missing $id). Skip dot-prefixed sidecars in both script/validate.sh and the validate-law-yaml hook.
There was a problem hiding this comment.
This PR adds a feedback loop that lets the enrichment agent declare the related legislation it just modeled, then enqueues follow-up harvests so the dependency graph fills itself in recursively. The core design (sidecar envelope, hybrid BWB resolution, depth/priority capping, best-effort error handling) is sound, and the unit tests cover the main serialization and resolution paths well.
Two issues found in the new resolution logic:
🟠 Significant — CVDR slug hit falls through to SRU name search
packages/pipeline/src/worker.rs, in resolve_related_bwb_id
When step (b) finds the slug in law_entries but the stored id is a CVDR id (not a BWB id), the code logs "skipping" but does not return — execution falls through to step (c), the SRU name search on wetten.overheid.nl.
Ok(Some(id)) if is_valid_bwb_id(&id) => return RelatedResolution::Resolved(id),
Ok(Some(id)) => {
tracing::debug!(slug = %slug, resolved = %id, "slug resolved to a non-BWB id; skipping");
// no return here — falls through to SRU search
}law_entries does store CVDR ids (confirmed — the harvest worker calls upsert_law with the CVDR job.law_id). A law named e.g. "Verordening X" can have a slug hit mapping to a CVDR id, then a single SRU hit for a different national law with a similar name, producing Resolved(wrong_bwb_id) where Unresolved was intended. The comment itself says "skip non-BWB hits" — the code should do exactly that:
Ok(Some(id)) => {
tracing::debug!(slug = %slug, resolved = %id, "slug resolved to a non-BWB id; skipping");
return RelatedResolution::Unresolved; // add this
}🟡 Minor — resolved counter inflated by exhausted-skipped laws
packages/pipeline/src/worker.rs, in harvest_related_legislation
resolved += 1 fires before the HarvestExhausted guard. When a law is exhausted, the loop continues without incrementing enqueued, so the summary log silently folds both "exhausted-skipped" and "already-queued (dedup)" into the resolved − enqueued gap with no way to distinguish them.
resolved += 1; // ← incremented here
if let Ok(law) = law_status::get_law(pool, &bwb_id).await {
if law.status == LawStatusValue::HarvestExhausted {
continue; // enqueued not incremented — gap is ambiguous
}
}Moving resolved += 1 to after the exhausted check, or adding a dedicated exhausted counter to the summary log, would make the log actionable.
🟡 Minor — blocking I/O in async context
packages/pipeline/src/enrich.rs:438
read_enrichment_result_envelope uses std::fs::read_to_string (synchronous) and is called directly from the async execute_enrich_with_runner. The surrounding code already uses tokio::fs::write. In practice the file is tiny and arrives after a minutes-long LLM call, so the impact is negligible on normal disk — but under a slow filesystem (NFS, container overlay under memory pressure) this blocks the Tokio worker thread indefinitely. Wrapping in tokio::task::spawn_blocking or switching to tokio::fs::read_to_string would match the pattern used elsewhere in the file.
Address CI review findings: - A CVDR slug hit no longer falls through to the SRU name search (the slug already identified the law; a title match could resolve a *different* national law). It now returns Unresolved. - The harvest summary log separates already_queued and exhausted skips instead of conflating them in the resolved-but-not-enqueued gap. Also drop RFC-025: the related-legislation harvest loop is an implementation detail, not a cross-cutting design decision that warrants an RFC.
|
Both review findings addressed in 4a24029: the CVDR slug hit now returns |
There was a problem hiding this comment.
Pipeline: harvest related legislation discovered during enrichment
Correctness
🟡 Minor — SRU result not validated as BWB ID (inconsistency with paths a and b)
packages/pipeline/src/worker.rs — the resolve_related_bwb_id function resolves via three paths:
- (a) explicit
bwb_id: guarded byis_valid_bwb_id - (b) slug lookup: guarded by
is_valid_bwb_id - (c) SRU single hit:
results[0].bwb_id.clone()returned directly, no validation
BwbSearchResult.bwb_id comes from parse_sru_response which parses wetten.overheid.nl SRU output — in practice that's always a BWBR ID — but the inconsistency means a malformed response would propagate an invalid ID into HarvestPayload.bwb_id. Since the harvest is best-effort this fails silently, but it's an easy one-liner fix:
Ok(results) if results.len() == 1 && is_valid_bwb_id(&results[0].bwb_id) => {
RelatedResolution::Resolved(results[0].bwb_id.clone())
}🟡 Minor — blocking I/O in async context
packages/pipeline/src/enrich.rs:438 — read_enrichment_result_envelope calls std::fs::read_to_string, blocking the Tokio runtime thread. Every other file operation in execute_enrich_with_runner uses tokio::fs. For a tiny YAML sidecar the impact is negligible, but it's inconsistent:
// current (sync, blocks runtime thread)
let content = match std::fs::read_to_string(&envelope_path) { … }
// idiomatic in this async context
let content = match tokio::fs::read_to_string(&envelope_path).await { … }Requires making read_enrichment_result_envelope async and threading .await through the caller.
🟡 Minor — depth comment at harvest→enrich handoff is misleading
packages/pipeline/src/worker.rs — the comment at depth: payload.depth:
// Inherit the just-completed harvest's depth so a
// related-legislation harvest chain keeps counting.
depth: payload.depth,payload here is a HarvestPayload whose depth is the extref recursion counter, not the related-legislation chain counter. The comment implies only related chains inherit this value. In reality, any harvest at extref depth ≥ related_harvest_max_depth (2 by default) will have its enrichment silently skip related-legislation follow-ups — the PR description acknowledges this as a known limitation but the comment in code doesn't signal it. Worth expanding:
// Inherit the harvest's depth (extref OR related-chain). This
// conflates both depth counters — a law reached via ≥2 extref hops
// won't get its related legislation harvested. Clean-up: a separate
// field on HarvestPayload.
depth: payload.depth,Rest of the diff
bwb_search.rsextraction ofsearch_bwb_by_name: clean refactor, error mapping is correct.config.rs/ env parsing:parse().ok()silently falls back to 2 on any bad value — appropriate.enrich.rsenvelope structs: serde defaults are correct; graceful degradation (absent/malformed → empty) is correct..pre-commit-config.yaml/validate.sh: exclusion regexes are correct and consistent.SKILL.mdPhase 4.5: clear instructions; the CRITICAL warning not to put the envelope in the law YAML is exactly right.- Unit tests cover serialization, absent/malformed envelopes, clamping, and slug normalization. Good coverage.
No critical or significant issues.
There was a problem hiding this comment.
Pipeline: harvest related legislation discovered during enrichment
Correctness
🟡 Minor — SRU result not validated as BWB ID (inconsistency with paths a and b)
packages/pipeline/src/worker.rs — resolve_related_bwb_id guards path (a) and (b) with is_valid_bwb_id, but path (c) returns results[0].bwb_id.clone() directly:
Ok(results) if results.len() == 1 => RelatedResolution::Resolved(results[0].bwb_id.clone()),BwbSearchResult.bwb_id comes from parse_sru_response parsing wetten.overheid.nl SRU output — in practice always a BWBR ID — but the inconsistency means a malformed API response propagates an invalid ID into HarvestPayload.bwb_id. Since the harvest is best-effort this fails silently; easy fix:
Ok(results) if results.len() == 1 && is_valid_bwb_id(&results[0].bwb_id) => {
RelatedResolution::Resolved(results[0].bwb_id.clone())
}🟡 Minor — blocking I/O in async context
packages/pipeline/src/enrich.rs:438 — read_enrichment_result_envelope calls std::fs::read_to_string, blocking the Tokio runtime thread. Every other file operation in execute_enrich_with_runner uses tokio::fs. For a tiny sidecar the impact is negligible, but it's inconsistent with the surrounding code.
🟡 Minor — depth comment at harvest→enrich handoff understates the conflation
packages/pipeline/src/worker.rs:
// Inherit the just-completed harvest's depth so a
// related-legislation harvest chain keeps counting.
depth: payload.depth,payload here is a HarvestPayload whose depth is the extref recursion counter. The comment implies only related chains inherit this value. In reality any law reached via ≥ RELATED_HARVEST_MAX_DEPTH extref hops will silently have its related-legislation follow-ups skipped. The PR description documents this as a known limitation; the in-code comment should say so too.
Rest of the diff
bwb_search.rsextraction ofsearch_bwb_by_name: clean refactor, error mapping is correct.config.rsenv parsing:parse().ok()silently falls back to 2 on bad input — appropriate for a tuning knob.enrich.rsenvelope structs: serde defaults are correct; graceful degradation (absent/malformed → empty) is correct..pre-commit-config.yaml/validate.sh: exclusion patterns are correct and consistent with each other.SKILL.mdPhase 4.5: clear instructions; the CRITICAL warning not to put the envelope inside the law YAML is exactly right.- Unit tests cover serialization, absent/malformed envelopes, priority clamping, and slug normalization — good coverage.
No critical or significant issues.
- Validate the single-hit SRU result as a BWB id before resolving, so a malformed SRU id can't slip into a harvest payload (paths a/b already validate). - Read the .enrichment-result.yaml sidecar via tokio::fs (was blocking std::fs) for consistency with the rest of execute_enrich_with_runner. - Clarify the depth-inherit comment: the field is the shared extref-recursion counter, so deep-via-extref laws skip related discovery (roots/shallow laws, the intended case, are unaffected).
Why
Regelingen reached only through the machine-readable model — a
source.regulationbinding, alegal_basisanchor, or an open delegation (open_terms/implements, e.g.regeling_standaardpremiefillingwet_op_de_zorgtoeslag'sstandaardpremie) — are never auto-harvested. The recursive harvester follows only<extref>BWB hyperlinks in the source text, and those model links don't exist in the text — they are a product of enrichment. So the discovery must happen where the model is created: the enrichworker (which has corpus + DB). (A prior admin-side closure endpoint, #894, was closed — the deployed admin has no corpus.)What
The enrichment agent now returns the related legislation it needs in a result envelope — a
.enrichment-result.yamlsidecar the agent writes, kept deliberately out of the law YAML so the law artifact stays schema-conformant (extensible for future enrichment return-values). After a successful enrich, the worker:related_legislation,bwb_id→law_entriesslug lookup → single-hit SRU title search (>1→needs_confirmation, skip),Newly harvested laws auto-enrich (existing
ENRICH_AUTO_ENQUEUE) and return their related legislation → the dependency graph fills itself by recursion.Depth & priority
EnrichPayload.depthcarries recursion depthharvest → enrich → harvest; each related harvest isdepth+1atpriority = 40 − (depth+1), so deeper (more speculative) nesting yields to shallower/interactive harvests (Priority::newclamps0..=100; queue isORDER BY priority DESC).Safety / guards
HARVEST_RELATED_LEGISLATION(off by default) +RELATED_HARVEST_MAX_DEPTH(default 2).ENRICH_DAILY_LIMIT(spend) andcreate_harvest_job_if_not_exists(dedup);harvest_exhaustedskipped.Also
search_bwb_by_name(client, q)extracted from the axum handler (reused by both);find_bwb_id_by_slugmadepub; slug hits re-validated as BWB (CVDR skipped, so no malformed harvest)..claude/skills/law-generate/SKILL.mdinstructs writing the envelope.Known limitations
related_depthfield is the clean follow-up.enrich/{provider}branch).Testing
is_valid_bwb_id/slugify,related_harvest_priority(clamps, −1/level), depth round-trips.HARVEST_RELATED_LEGISLATION=true, (re-)enrichwet_op_de_zorgtoeslag→ confirm it writes.enrichment-result.yaml, a follow-up harvest for the delegated regeling is enqueued (depth 1, priority 39), harvested, then auto-enriched. VerifyENRICH_DAILY_LIMITstill bounds spend.